[core] Move stream reconnect logic to getReadable level by VaguelySerious · Pull Request #1847 · vercel/workflow

VaguelySerious · 2026-04-23T22:55:57Z

Moves stream reconnect handling out of the world-vercel adapter and up to the getReadable/core level, where chunk framing already lives — so reconnect works the same way across world adapters.

Reverts #1790 (the adapter-level control-frame approach). The reconnecting reader counts the 4-byte length-prefixed frames it has received and, on a connection error, reopens the stream from startIndex + framesConsumed. A clean end-of-stream is treated as completion (no reconnect). Object/serialized streams only — raw byte streams have no wire framing to count and are opted out (the caller owns its own reconnect strategy). Bounded by a consecutive-failure cap (reset on forward progress) plus an absolute total-reconnect backstop.

Closes #1801
Closes #1802

After shipping this

Forward-ported to main in #2318. See the cross-PR comment for merge order — this only takes effect once paired with the coordinated server-side change that errors a timed-out stream connection instead of closing it cleanly.

…nection (#1790)" This reverts commit 5ef9ac2.

Signed-off-by: Peter Wielander <mittgfu@gmail.com>

changeset-bot · 2026-04-23T22:56:01Z

🦋 Changeset detected

Latest commit: 4a0258b

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 17 packages

Name	Type
@workflow/world-vercel	Patch
@workflow/core	Patch
@workflow/cli	Patch
@workflow/web	Patch
@workflow/builders	Patch
@workflow/next	Patch
@workflow/nitro	Patch
@workflow/vitest	Patch
@workflow/web-shared	Patch
workflow	Patch
@workflow/world-testing	Patch
@workflow/astro	Patch
@workflow/nest	Patch
@workflow/rollup	Patch
@workflow/sveltekit	Patch
@workflow/vite	Patch
@workflow/nuxt	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

vercel · 2026-04-23T22:56:02Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
example-nextjs-workflow-turbopack	Ready	Preview, Comment	Jun 10, 2026 8:12pm
example-nextjs-workflow-webpack	Ready	Preview, Comment	Jun 10, 2026 8:12pm
example-workflow	Ready	Preview, Comment	Jun 10, 2026 8:12pm
workbench-astro-workflow	Ready	Preview, Comment	Jun 10, 2026 8:12pm
workbench-express-workflow	Ready	Preview, Comment	Jun 10, 2026 8:12pm
workbench-fastify-workflow	Ready	Preview, Comment	Jun 10, 2026 8:12pm
workbench-hono-workflow	Ready	Preview, Comment	Jun 10, 2026 8:12pm
workbench-nitro-workflow	Ready	Preview, Comment	Jun 10, 2026 8:12pm
workbench-nuxt-workflow	Ready	Preview, Comment	Jun 10, 2026 8:12pm
workbench-sveltekit-workflow	Ready	Preview, Comment	Jun 10, 2026 8:12pm
workbench-tanstack-start-workflow	Ready	Preview, Comment	Jun 10, 2026 8:12pm
workbench-vite-workflow	Ready	Preview, Comment	Jun 10, 2026 8:12pm
workflow-docs	Ready	Preview, Comment, Open in v0	Jun 10, 2026 8:12pm
workflow-swc-playground	Ready	Preview, Comment	Jun 10, 2026 8:12pm
workflow-tarballs	Ready	Preview, Comment	Jun 10, 2026 8:12pm
workflow-web	Ready	Preview, Comment	Jun 10, 2026 8:12pm

github-actions · 2026-04-23T23:02:33Z

🧪 E2E Test Results

❌ Some tests failed

Summary

	Passed	Failed	Skipped	Total
❌ ▲ Vercel Production	922	1	67	990
✅ 💻 Local Development	994	0	86	1080
✅ 📦 Local Production	994	0	86	1080
✅ 🐘 Local Postgres	994	0	86	1080
✅ 🪟 Windows	90	0	0	90
❌ 🌍 Community Worlds	130	92	6	228
✅ 📋 Other	504	0	36	540
Total	4628	93	367	5088

❌ Failed Tests

▲ Vercel Production (1 failed)

vite (1 failed):

sleepWithSequentialStepsWorkflow - sequential steps work with concurrent sleep (control) | wrun_01KTTRX1EEHJT9R3HNMTBPBJGD | 🔍 observability

🌍 Community Worlds (92 failed)

mongodb (14 failed):

hookWorkflow is not resumable via public webhook endpoint | wrun_01KTTRFJMZ6V3RDX89MZSNZ9XY
webhookWorkflow | wrun_01KTTRFTM23WFNE06BANHG475K
sleepingWorkflow | wrun_01KTTRG1FYXAJ3MN4987EH643G
outputStreamWorkflow no startIndex (reads all chunks)
outputStreamWorkflow negative startIndex (reads from end)
outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns correct index after stream completes
outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns -1 before any chunks are written
outputStreamWorkflow - getTailIndex and getStreamChunks getStreamChunks returns same content as reading the stream
outputStreamInsideStepWorkflow - getWritable() called inside step functions | wrun_01KTTRJWG7WA0WFA05FVANG4Q2
writableForwardedFromWorkflowWorkflow | wrun_01KTTRK9M5K8W6DVSAAK6T9NJF
writableForwardedFromStepWorkflow | wrun_01KTTRKDQC4ZG1ETVGC88445V9
concurrent hook token conflict - two workflows cannot use the same hook token simultaneously | wrun_01KTTRQHQA25MTBBNY59H41CCQ
pages router sleepingWorkflow via pages router
resilient start: addTenWorkflow completes when run_created returns 500 | wrun_01KTTRZ27MQK6QBMJZMAWBE8YD

redis (10 failed):

hookWorkflow | wrun_01KTTRFBT7E3JV0BECDRSVJRW9
hookWorkflow is not resumable via public webhook endpoint | wrun_01KTTRFJMZ6V3RDX89MZSNZ9XY
sleepingWorkflow | wrun_01KTTRG1FYXAJ3MN4987EH643G
outputStreamWorkflow negative startIndex (reads from end)
outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns correct index after stream completes
outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns -1 before any chunks are written
outputStreamWorkflow - getTailIndex and getStreamChunks getStreamChunks returns same content as reading the stream
concurrent hook token conflict - two workflows cannot use the same hook token simultaneously | wrun_01KTTRQHQA25MTBBNY59H41CCQ
pages router sleepingWorkflow via pages router
resilient start: addTenWorkflow completes when run_created returns 500 | wrun_01KTTRZ27MQK6QBMJZMAWBE8YD

turso (68 failed):

addTenWorkflow | wrun_01KTTREEZXQEZ435BXJ3NGR42J
addTenWorkflow | wrun_01KTTREEZXQEZ435BXJ3NGR42J
wellKnownAgentWorkflow (.well-known/agent) | wrun_01KTTRESEA7AE2M3DWABXYSQR9
should work with react rendering in step
promiseAllWorkflow | wrun_01KTTRENRHWX6SECVCZXWTP0TN
promiseRaceWorkflow | wrun_01KTTREVE71KGCSNDACKXZ08G8
promiseAnyWorkflow | wrun_01KTTREXHA18JAQC5WD7DDP81M
importedStepOnlyWorkflow | wrun_01KTTRF5VM7A2TF17HTSDFWRDM
readableStreamWorkflow | wrun_01KTTREZRN5B2GEQDQHVSRXKWP
hookWorkflow | wrun_01KTTRFBT7E3JV0BECDRSVJRW9
hookWorkflow is not resumable via public webhook endpoint | wrun_01KTTRFJMZ6V3RDX89MZSNZ9XY
webhookWorkflow | wrun_01KTTRFTM23WFNE06BANHG475K
sleepingWorkflow | wrun_01KTTRG1FYXAJ3MN4987EH643G
parallelSleepWorkflow | wrun_01KTTRGHBTBRSZP6HN856RPZDB
nullByteWorkflow | wrun_01KTTRGMJYQ956R3YGC81PCNXV
workflowAndStepMetadataWorkflow | wrun_01KTTRGPS3XBF2E0PMDV3G4XJN
outputStreamWorkflow no startIndex (reads all chunks)
outputStreamWorkflow positive startIndex (skips first chunk)
outputStreamWorkflow negative startIndex (reads from end)
outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns correct index after stream completes
outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns -1 before any chunks are written
outputStreamWorkflow - getTailIndex and getStreamChunks getStreamChunks returns same content as reading the stream
outputStreamInsideStepWorkflow - getWritable() called inside step functions | wrun_01KTTRJWG7WA0WFA05FVANG4Q2
writableForwardedFromWorkflowWorkflow | wrun_01KTTRK9M5K8W6DVSAAK6T9NJF
writableForwardedFromStepWorkflow | wrun_01KTTRKDQC4ZG1ETVGC88445V9
fetchWorkflow | wrun_01KTTRKHG3335J5GR1TEPN71Z2
promiseRaceStressTestWorkflow | wrun_01KTTRKTQR51GBWDHWWW595HZ1
error handling error propagation workflow errors nested function calls preserve message and stack trace
error handling error propagation workflow errors cross-file imports preserve message and stack trace
error handling error propagation step errors basic step error preserves message and stack trace
error handling error propagation step errors cross-file step error preserves message and function names in stack
error handling retry behavior regular Error retries until success
error handling retry behavior FatalError fails immediately without retries
error handling retry behavior RetryableError respects custom retryAfter delay
error handling retry behavior maxRetries=0 disables retries
error handling catchability FatalError can be caught and detected with FatalError.is()
error handling not registered WorkflowNotRegisteredError fails the run when workflow does not exist
error handling not registered StepNotRegisteredError fails the step but workflow can catch it
error handling not registered StepNotRegisteredError fails the run when not caught in workflow
hookCleanupTestWorkflow - hook token reuse after workflow completion | wrun_01KTTRQ5H4GSGSQ0QHYG90B9YH
concurrent hook token conflict - two workflows cannot use the same hook token simultaneously | wrun_01KTTRQHQA25MTBBNY59H41CCQ
hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running | wrun_01KTTRR0DS6H5YS7WRGE4F9W6J
stepFunctionPassingWorkflow - step function references can be passed as arguments (without closure vars) | wrun_01KTTRRH9MGK6JQ6DQHAE944VT
stepFunctionWithClosureWorkflow - step function with closure variables passed as argument | wrun_01KTTRRSZBKXGZQY2GG1YADFY1
closureVariableWorkflow - nested step functions with closure variables | wrun_01KTTRRZFNE0WS4358KN339TYS
spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step | wrun_01KTTRS1FTJ2V10Y4F2PYA5653
health check (queue-based) - workflow and step endpoints respond to health check messages
health check (CLI) - workflow health command reports healthy endpoints
pathsAliasWorkflow - TypeScript path aliases resolve correctly | wrun_01KTTRSF7JD7RJQVGER1W9V5B3
Calculator.calculate - static workflow method using static step methods from another class | wrun_01KTTRSMM4T19KT3M2GT93CJNG
AllInOneService.processNumber - static workflow method using sibling static step methods | wrun_01KTTRSVAZF0HH71FFET205NE4
ChainableService.processWithThis - static step methods using this to reference the class | wrun_01KTTRT1SF7ZY0XBPGJYYN0K0D
thisSerializationWorkflow - step function invoked with .call() and .apply() | wrun_01KTTRT8CRJ8935ZV0ETXN7JR4
customSerializationWorkflow - custom class serialization with WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE | wrun_01KTTRTF4EJBR080XK1GPHAY20
instanceMethodStepWorkflow - instance methods with "use step" directive | wrun_01KTTRTNWZY6HY5554ZPEH61KR
crossContextSerdeWorkflow - classes defined in step code are deserializable in workflow context | wrun_01KTTRV20AMJDT7VBZYQH6DJC5
stepFunctionAsStartArgWorkflow - step function reference passed as start() argument | wrun_01KTTRVAVSQPRY81DYQKRY0DXQ
cancelRun - cancelling a running workflow | wrun_01KTTRVHA4PMW4MFRQHFPJM3B3
cancelRun via CLI - cancelling a running workflow | wrun_01KTTRVTM64YZFC25VZKKPN5NW
pages router addTenWorkflow via pages router
pages router promiseAllWorkflow via pages router
pages router sleepingWorkflow via pages router
hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep | wrun_01KTTRW6RN7M6MZHNPXJ96H6C6
sleepInLoopWorkflow - sleep inside loop with steps actually delays each iteration | wrun_01KTTRWP4J2JEJR8CTS98BAJYD
sleepWithSequentialStepsWorkflow - sequential steps work with concurrent sleep (control) | wrun_01KTTRX1EEHJT9R3HNMTBPBJGD
importMetaUrlWorkflow - import.meta.url is available in step bundles | wrun_01KTTRYXV3NN19055TB2MGD052
metadataFromHelperWorkflow - getWorkflowMetadata/getStepMetadata work from module-level helper (#1577) | wrun_01KTTRYZZ2VQWJ547KQY598T9T
resilient start: addTenWorkflow completes when run_created returns 500 | wrun_01KTTRZ27MQK6QBMJZMAWBE8YD

Details by Category

❌ ▲ Vercel Production

App	Passed	Failed	Skipped
✅ astro	83	0	7
✅ example	83	0	7
✅ express	83	0	7
✅ fastify	83	0	7
✅ hono	83	0	7
✅ nextjs-turbopack	88	0	2
✅ nextjs-webpack	88	0	2
✅ nitro	83	0	7
✅ nuxt	83	0	7
✅ sveltekit	83	0	7
❌ vite	82	1	7

✅ 💻 Local Development

App	Passed	Skipped
✅ astro-stable	84	6
✅ express-stable	84	6
✅ fastify-stable	84	6
✅ hono-stable	84	6
✅ nextjs-turbopack-canary	71	19
✅ nextjs-turbopack-stable	90	0
✅ nextjs-webpack-canary	71	19
✅ nextjs-webpack-stable	90	0
✅ nitro-stable	84	6
✅ nuxt-stable	84	6
✅ sveltekit-stable	84	6
✅ vite-stable	84	6

✅ 📦 Local Production

App	Passed	Skipped
✅ astro-stable	84	6
✅ express-stable	84	6
✅ fastify-stable	84	6
✅ hono-stable	84	6
✅ nextjs-turbopack-canary	71	19
✅ nextjs-turbopack-stable	90	0
✅ nextjs-webpack-canary	71	19
✅ nextjs-webpack-stable	90	0
✅ nitro-stable	84	6
✅ nuxt-stable	84	6
✅ sveltekit-stable	84	6
✅ vite-stable	84	6

✅ 🐘 Local Postgres

App	Passed	Skipped
✅ astro-stable	84	6
✅ express-stable	84	6
✅ fastify-stable	84	6
✅ hono-stable	84	6
✅ nextjs-turbopack-canary	71	19
✅ nextjs-turbopack-stable	90	0
✅ nextjs-webpack-canary	71	19
✅ nextjs-webpack-stable	90	0
✅ nitro-stable	84	6
✅ nuxt-stable	84	6
✅ sveltekit-stable	84	6
✅ vite-stable	84	6

✅ 🪟 Windows

App	Passed	Failed	Skipped
✅ nextjs-turbopack	90	0	0

❌ 🌍 Community Worlds

App	Passed	Failed	Skipped
✅ mongodb-dev	3	0	2
❌ mongodb	57	14	0
✅ redis-dev	3	0	2
❌ redis	61	10	0
✅ turso-dev	3	0	2
❌ turso	3	68	0

✅ 📋 Other

App	Passed	Skipped
✅ e2e-local-dev-nest-stable	84	6
✅ e2e-local-dev-tanstack-start-stable	84	6
✅ e2e-local-postgres-nest-stable	84	6
✅ e2e-local-postgres-tanstack-start-stable	84	6
✅ e2e-local-prod-nest-stable	84	6
✅ e2e-local-prod-tanstack-start-stable	84	6

📋 View full workflow run

❌ Some E2E test jobs failed:

Vercel Prod: failure
Local Dev: success
Local Prod: success
Local Postgres: success
Windows: success

Check the workflow run for details.

⚠️ Community world tests failed (non-blocking):

Community Worlds: failure

Check the workflow run for details.

Signed-off-by: Peter Wielander <mittgfu@gmail.com>

TooTallNate

Review

The architectural shift makes sense: client-side frame counting is a cleaner abstraction than wire-level control frames, and moving it to core means it works for any world that returns a ReadableStream from readFromStream, not just world-vercel. The reconnect math, frame-counting, and partial-frame discard are all correct.

But there are two significant concerns I think need addressing before merge.

1. Byte streams lose auto-reconnect entirely

The PR explicitly opts byte streams out of reconnect:

if (value.type === 'bytes') {
  // No auto-reconnect here: raw byte streams have no wire framing
  const readable = new WorkflowServerReadableStream(value.name, value.startIndex);
  // ...
} else {
  const readable = createReconnectingFramedStream(value.name, value.startIndex);
  // ...
}

The reason given is technically correct (no wire framing → no chunk boundary detection client-side), but this is a regression vs. the reverted #1790, which handled byte streams just fine because the server sent the resume hint via control frame.

The use cases that lose auto-reconnect:

AI streaming responses (text/SSE) piped from getWritable()
Any HTTP route doing return new Response(run.getReadable()) for raw bytes
Any streaming workflow output that goes more than 2 minutes (the prior server-side timeout window) and uses byte type

The docs callout added by this PR points users to WorkflowChatTransport and supportsCancellation, but those address a different problem (cancellation, not reconnect). Pushing reconnect to the application layer — where every consumer has to reimplement it — is a step backward in usability.

Possible directions:

Frame byte streams on the writable side too (4 bytes per chunk overhead) so createReconnectingFramedStream works for them. The user-facing surface stays raw bytes; only the wire format changes.
Keep the control-frame approach for byte streams only as a hybrid — frame counting for non-byte streams, server-side hint for byte streams.
Document this as an explicit limitation and update the docs callout to specifically warn about byte streams losing reconnect, not just talk about supportsCancellation (separate issue).

(1) seems best to me — it removes the asymmetry entirely and keeps the cleaner architecture.

2. The "clean EOF means done" assumption needs verification

if (result.done || !result.value) {
  // Clean EOF — stream is truly complete...
  controller.close();
  return;
}

This assumes the workflow-server signals "done" and "timeout/aborted" differently at the network level — clean done = FIN, timeout = error/reset. The deleted control-frame logic disambiguated these because both manifested as clean closes from a TCP perspective; the magic-footer frame was the disambiguator.

Without the control frame, the new code can't tell them apart. If the workflow-server's 2-minute timeout sends a clean FIN (rather than a TCP reset or stream error), this PR will appear to "complete" any stream that hits 2 minutes.

Is that assumption verified against the actual server behavior? The new test simulates max-duration as controller.error(...), which is fine for the unit test, but I'd want to see either:

An e2e test confirming a real long-lived stream against workflow-server triggers reconnect (not premature close)
A statement in the PR description / commit explaining why the server-side timeout is now an error not a clean close (was the workflow-server changed? was the timeout removed?)

The supportsCancellation callout suggests the architecture has shifted such that streams now run for the full function maxDuration rather than the old 2-minute server timeout — but if so, that's a precondition for this PR and worth calling out explicitly.

Minor

See inline comments.

What looks good

Frame-counting math is correct: currentStartIndex += consumedFrames resumes at the right place, partial-frame buffer is correctly discarded, the math is symmetric for non-zero initial startIndex.
Negative startIndex correctly bypasses reconnect with a clear reason (can't compute absolute resume index without a tail-index lookup) — and there's a test for it.
AbortController plumbing in world-vercel readFromStream is the right primitive. Cancel propagation through cancel(reason) { abortController.abort(reason) } correctly tears down the fetch.
Test coverage for createReconnectingFramedStream is good — frames split across reads, partial frame at error, clean EOF, non-zero initial startIndex, negative startIndex bypass, cancel propagation. Six tests, all targeted.
Two changesets correctly scoped: @workflow/core for the new wrapper, @workflow/world-vercel for the cancel propagation.

TooTallNate · 2026-04-25T00:05:09Z

-        value.name,
-        value.startIndex
-      );
      if (value.type === 'bytes') {


Byte streams are intentionally opted out of auto-reconnect here. This is a behavioral regression vs. the reverted #1790, which handled byte streams via server-sent control frames.

The comment correctly identifies why this is hard (no wire framing → no chunk boundary detection client-side), but pushing reconnect to the application layer means:

Every consumer of run.getReadable() for byte streams (AI text streaming, raw HTTP responses, etc.) has to implement its own reconnect logic.

The docs callout added by this PR (about supportsCancellation) doesn't actually help — that's a cancellation fix, not a reconnect fix.

I think the right move is to frame byte streams on the writable side too (4 bytes per chunk overhead), so createReconnectingFramedStream can be used uniformly. The user-facing API stays raw bytes; only the wire format gets the length prefix. That removes the asymmetry and keeps the cleaner architecture this PR is trying to achieve.

TooTallNate · 2026-04-25T00:05:09Z

+ * the writable buffers one frame per chunk when multi-writing). The wrapper
+ * counts completed frames and, on upstream error, reopens the connection
+ * with `startIndex = resolvedStartIndex + consumedFrames`. Partial-frame
+ * bytes buffered before the cut are discarded — the server will resend the


Comment says On serverfull backends, reconnects should only happen during transient errors. For serverless backends, we set this constant so that we cover at least 10 minutes even if the server would be limited to e.g. 1 minute per session.

10 reconnects \u00d7 1-minute-per-session = 10 minutes covered. That's tighter than the deleted constant in world-vercel (MAX_RECONNECTS = 50, ~100 minutes coverage at 2-min server timeouts). If the underlying assumption is that streams now run for full function maxDuration (which on Pro/Enterprise can exceed 10 minutes), this cap may be too low.

Worth either:

Bumping the constant to match the longest realistic maxDuration (~15 min Pro), so something like 30, or

Making it configurable per-call (or via the world)

TooTallNate · 2026-04-25T00:05:09Z

+          console.warn("Error closing ReadableStream reader:", err)
+        });
+        reader = undefined;
+      }


Nit: cancel() here only cancels the active reader. There's a small race window: if cancel fires while connect() is in flight (between reader = undefined after a reconnect-triggering error and the new reader being assigned), there's nothing to cancel — the new connection completes and the loop continues reading.

A cancelled flag checked at the top of the pull loop and inside connect() would close this. Same race existed in the deleted world-vercel cancel handler, so it's not a regression — just worth tightening if you're touching this code.

let cancelled = false; // ... in pull loop, top of for(;;): if (cancelled) { controller.close(); return; } // ... in cancel: cancelled = true;

TooTallNate · 2026-04-25T00:05:09Z

+    const { world } = makeWorldWithScriptedStreams({
+      0: () =>
+        scriptedStream([
+          // Split frame into 3 byte-level reads to prove boundary-aware


Test simulates max-duration abort as controller.error(...) — which is correct for what the wrapper sees on a network reset, but doesn't verify the actual workflow-server behavior matches.

If workflow-server's stream timeout sends a clean FIN (i.e., calls controller.close() on its end) instead of an error, this code path will treat it as EOF and not reconnect. The control-frame logic that this PR removes was specifically designed to disambiguate these two cases.

Could you confirm in the PR description whether:

workflow-server's stream timeout has been removed entirely (streams now run for full function maxDuration), OR

the timeout still exists but now manifests as a network error / TCP reset rather than a clean FIN?

This is the load-bearing assumption of the whole design.

TooTallNate · 2026-04-25T06:43:11Z

Following up after the discussion thread — consolidating the recommended direction so it's all in one place.

Recommended direction

Move byte-stream framing into core, gated on a per-run feature flag, with the resolved choice baked into the serialized stream ref.

The PR's instinct (move reconnect to core) is right. The concrete change to make it work uniformly for byte streams:

1. Frame byte streams on the writer side

In serialization.ts, the byte-stream branch of the ReadableStream reducer currently does:

ops.push(value.pipeTo(writable));

It would become:

ops.push(
  value
    .pipeThrough(getByteFramingStream())  // wrap each chunk in [4-byte len][bytes]
    .pipeTo(writable)
);

Cost: 4 bytes per server-side chunk. For typical streaming workloads (AI text chunks of dozens of bytes, structured byte payloads in the KB+ range) this is well under 5% overhead.

2. Use `createReconnectingFramedStream` for both branches on the reader side

The non-byte branch already does. The byte branch additionally pipes through an unframing transform that strips the 4-byte length and emits raw bytes to a type: 'bytes' WHATWG stream — preserving the user-facing API exactly as it is today.

3. WHATWG `type: 'bytes'` semantics are unaffected

To clarify a point that came up in the discussion: WHATWG's type: 'bytes' is purely about the reader-side API (BYOB readers, Uint8Array chunks, optional autoAllocateChunkSize). The spec says nothing about wire format or chunk-boundary semantics. Whether the bytes are framed on the wire is a transport choice the SDK gets to make — it doesn't change what the user sees from getReader().

So the framing change is purely internal to serialization. User-facing API is identical.

Backwards compatibility

This is the load-bearing concern, since byte-stream wire format becomes a versioning surface.

Cross-version exposures (post-version-skew-protection)

Within a single run: no exposure. Workflow runs are pinned to one deployment, so all chunks of any stream within a run are written and read by the same SDK version.

The only real exposures are streams that cross the run boundary via hook payloads, where the producer and consumer can be different SDK versions:

Newer caller → older run (resumeHook(token, { stream: writable }) where the older run writes to it): older writer can't frame, newer reader must accept raw.
Newer caller → older run (resumeHook(token, { stream: readable }) where the older run reads from it): older reader can't unframe, newer writer must produce raw.
Older caller → newer run: mirror cases — newer side must defer to the older side's format.

In all cases, the framing decision must be made at the producer side based on the consumer side's capability.

Proposed mechanism

Per-run feature flag in run.features, e.g. 'byte-stream-framing'. Set at run-creation time based on the SDK version of the run's pinned deployment.
NOT specVersion: that's reserved for World-protocol changes (queue transport, event schemas). Byte-stream framing is purely a core/serialization concern that worlds don't need to know about. Features are the right granularity.
Reducer resolves at serialization time: the ReadableStream / WritableStream reducer looks up the target run's features and decides framing. For hook payloads the target is the hook's owning run (already looked up by the resumeHook code path); for same-run streams the target is the current run.

Bake the resolved choice into the stream ref:

ReadableStream:
  | { name: string; type?: 'bytes'; startIndex?: number; framing?: 'raw' | 'framed-v1' }
  | { bodyInit: any };

Reader dispatches on the ref field: framing === 'framed-v1' → use createReconnectingFramedStream + unframing transform; framing === undefined | 'raw' → use existing WorkflowServerReadableStream (no reconnect).
Default is raw: absence of the field means raw, so existing serialized refs from older SDKs still work.
Auto-reconnect for byte streams becomes opt-in for new runs only. Old runs keep current no-reconnect behavior. Consistent with how feature flags work elsewhere in the codebase.

One implementation note

For start(workflow, args, { deploymentId }) with cross-deployment args, the args are serialized before the target run exists. The reducer needs a path to predict features for the target deployment without an actual run object — probably reading the deployment manifest's SDK version. Worth confirming this lookup is feasible at reducer-call time before committing to the design.

What's still open in the current PR

The architectural shift (frame counting in core, simpler world-vercel transport, AbortController plumbing) is good and should land. The two outstanding points from my prior review:

Byte streams are opted out of reconnect — addressed by the above.
"Clean EOF means done" assumption — still worth verifying explicitly. Either confirm that workflow-server's stream timeout now manifests as a network error (not a clean FIN), or document that this design only works if the server signals timeout-via-error.

The framing change for byte streams could be a follow-up PR if you want to keep the scope of this one tight, but the docs callout should at minimum be updated to clarify that byte streams currently lose auto-reconnect, distinct from the supportsCancellation issue (which is about cancellation, not reconnect).

TooTallNate

Approving — withdrawing my prior request-for-changes.

Context: my earlier blocker was that this PR opts byte streams out of auto-reconnect, which I called a regression vs. the now-reverted #1790. Since then we discussed it and settled on a different plan: this PR lands on stable as-is (object-stream reconnect only), and byte-stream support gets added on main/v5 via wire-level framing in a follow-up. The framing work is now in PRs #1854 (workflowCoreVersion on HealthCheckResult) and #1853 (the framing itself), which together let createReconnectingFramedStream be applied uniformly to byte streams on main once they land.

So for stable, this PR is the right scope:

Object-stream reconnect via createReconnectingFramedStream is correct.
Byte streams legitimately can't be auto-reconnected with the legacy unframed wire format that stable ships, so opting them out is the right call there.
Frame-counting math, AbortController plumbing, world-vercel simplification all look good.

The earlier non-blocking concerns I raised still apply — would be nice to address them but I'm not gating on them:

The "clean EOF means done" assumption. Worth a sentence in the commit/PR description confirming whether workflow-server's stream timeout now manifests as a network error rather than a clean FIN, since the deleted control-frame logic was specifically there to disambiguate them.
FRAMED_STREAM_MAX_RECONNECTS = 10 is tighter than the deleted MAX_RECONNECTS = 50. Probably fine, but worth a sanity check against the longest realistic Pro/Enterprise maxDuration.
Cancel race during reconnect — pre-existing, not a regression here.

Signed-off-by: Peter Wielander <mittgfu@gmail.com>

Co-authored-by: Peter Wielander <mittgfu@gmail.com> Signed-off-by: Peter Wielander <mittgfu@gmail.com>

…l-at-getreadable-level # Conflicts: # packages/world-vercel/src/streamer.test.ts # packages/world-vercel/src/streamer.ts

…preview - serialization: reset reconnectCount to 0 when a reconnect delivers a frame, so FRAMED_STREAM_MAX_RECONNECTS bounds *consecutive* failures (as documented) instead of the lifetime total. Long-lived serverless streams that reconnect repeatedly but keep delivering no longer get falsely capped. Export the constant for tests. - tests: add coverage for the max-consecutive-reconnect cap, the budget-reset-on-progress regression, and multi-frame-per-read drain. - world-vercel: temporarily point WORKFLOW_SERVER_URL_OVERRIDE at the peter-stream-timeout-error workflow-server preview so this branch's e2e exercises the matching server-side stream-timeout behavior. To be cleared before merge (see comment). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The consecutive-failure cap resets on forward progress, which is correct for a backend that honors startIndex. But a World whose readFromStream ignored startIndex and re-delivered earlier chunks would report progress on every reconnect, so the consecutive cap would never trip — an unbounded reconnect loop. Add FRAMED_STREAM_MAX_TOTAL_RECONNECTS (1000), a hard ceiling that never resets, so the loop always terminates while staying far above any legitimate long-lived stream. Add a test covering the pathological ignore-startIndex case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

VaguelySerious

AI review: blocking issues found

VaguelySerious · 2026-06-10T08:52:12Z

 */
-const WORKFLOW_SERVER_URL_OVERRIDE = '';
+const WORKFLOW_SERVER_URL_OVERRIDE =
+  'https://workflow-server-git-peter-stream-timeout-error.vercel.sh';


AI Review: Blocking

The hard-coded server override must be cleared back to '' before this merges — as written it points every consumer's stream traffic at a preview deployment. You've already flagged it as temporary and the two red CI signals are by-design, so this is just the merge gate: don't land until this constant is reset and those checks go green.

VaguelySerious · 2026-06-10T08:52:12Z

+      // fetch implementations differ on whether cancelling the body
+      // alone tears down the socket.
+      return new ReadableStream<Uint8Array>({
+        start(controller) {


AI Review: Note

The rewrap pumps the upstream eagerly in start() — the loop calls reader.read() → controller.enqueue(value) with no backpressure check, so it drains the upstream as fast as the socket delivers regardless of how fast the downstream consumer reads. The previous code returned response.body directly, which propagates backpressure to the socket. For a fast producer + slow consumer (exactly the long-lived streaming case this path serves), the wrapper can buffer the whole stream in the controller queue → unbounded memory.

Consider a pull-based source instead of an eager start() pump (read one chunk per pull, enqueue, return), or gate the pump on controller.desiredSize. That keeps cancel→abort propagation while preserving backpressure.

VaguelySerious · 2026-06-10T08:52:12Z

+ * hard ceiling guarantees the loop always terminates. It is set high enough
+ * (hours of streaming at realistic per-session timeouts) to never interfere
+ * with legitimate long-lived streams.
+ */


AI Review: Note

With the server now aborting in-progress streams at max-duration (rather than closing cleanly), the negative-startIndex branch becomes a behavior change worth calling out: those reads opt out of reconnect, so a last-N consumer that previously saw a clean EOF at the duration limit will now surface the abort as a hard error. The object-stream consumers that use negative indices (e.g. tail-resolving clients) resolve to an absolute index before connecting, so in practice they shouldn't hit this — but a doc line or a comment noting "negative startIndex + mid-stream server abort = error, not silent close" would save a future debugging session.

VaguelySerious · 2026-06-10T09:54:31Z

(AI) Cross-PR context & merge order

Together these make run.getReadable() transparently reconnect when a stream's underlying connection ends mid-stream (e.g. the periodic server-side max-duration cutoff), instead of surfacing a truncated stream to the consumer.

How the pieces fit:

[core] Move stream reconnect logic to getReadable level #1847 (→ stable) — adds a reconnecting reader at the getReadable/core level. It counts the length-prefixed frames it has received and, on a connection error, transparently reopens the stream from startIndex + framesConsumed. A clean end-of-stream is still treated as completion. Applies to object/serialized streams; raw byte streams are opted out (no wire framing to count).
[core] Forward-port stream reconnect to getReadable level #2318 (→ main) — the main forward-port of [core] Move stream reconnect logic to getReadable level #1847; same mechanism, adapted to main's stream-reading API.
[core] Add wire-level framing for byte streams #1853 (→ main) — opt-in wire framing for byte streams. Independent of the reconnect PRs and not required by them; it's the groundwork that lets a later follow-up extend the same frame-counting reconnect to byte streams.

Behavioural note: the reconnecting reader only reopens on a connection error — a clean close means "complete". So the client change is inert on its own and only takes effect once paired with the coordinated server-side change that ends a timed-out connection with an error rather than a silent close (handled separately). Shipping the client first is therefore safe and must precede that server change.

Suggested order:

[core] Move stream reconnect logic to getReadable level #1847 → stable (client reconnect; a no-op until the paired server change ships).
[core] Forward-port stream reconnect to getReadable level #2318 → main (same, on main). Release both so deployed apps gain reconnect.
The coordinated server-side timeout change — only after the reconnect-capable client is released.

#1853 is independent and can land on its own schedule; it unblocks byte-stream reconnect as a future follow-up.

TooTallNate

Approve — with one hard pre-merge gate (the URL override) and two non-blocking design notes

I built @workflow/core + @workflow/world-vercel from this branch and ran the new suites locally: all 10 reconnecting-framed-stream tests and all 20 streamer tests pass.

The design is right

Moving reconnect from the adapter's wire-sniffing control-frame approach (#1790, reverted here) up to the framing layer is the correct factoring. The framing layer is the one place that already knows chunk boundaries, so "count completed frames, resume at startIndex + consumed" needs no wire protocol additions at all — and it works identically for any World adapter, not just world-vercel.

Specifics I verified:

Partial-frame discard on reconnect is correct: buffered mid-frame bytes are dropped, currentStartIndex += consumedFrames, and the server resends the in-flight chunk in full. The test at line 140 simulates exactly the production scenario (3 bytes of a frame, then a max-duration abort) and asserts the resume index.
The two-tier budget is well-reasoned. Consecutive cap (50) resets on forward progress, so a long-lived stream that reconnects hundreds of times while still delivering is never falsely killed — tested with FRAMED_STREAM_MAX_RECONNECTS + 5 productive reconnects. The absolute backstop (1000) covers the pathological case where a backend ignores startIndex and reports false progress forever — also tested. Both constants exported and documented with the reasoning.
Clean EOF = completion, error = reconnect is the right contract, and negative startIndex (last-N) correctly opts out since an absolute resume position can't be computed.
Frames pass through with headers intact to the downstream deserializer, which already expects the framed layout — the wrapper only counts, it doesn't re-frame. Nice and minimal.
The byte-stream opt-out is correctly scoped (raw streams have no framing to count) and the doc callouts about supportsCancellation for long-lived stream routes are a genuinely useful addition independent of this change.

Hard gate before merge

WORKFLOW_SERVER_URL_OVERRIDE is pointed at a preview deployment. The in-code comment documents this as temporary and correctly predicts the red CI (the override lint guard + the 4 utils.test.ts override-precedence cases — I checked the failing unit job and those 4 are precisely the failures). This must go back to '' before merge, and CI needs one green re-run after the reset — the current Tests run is also three weeks old (May 29) relative to the branch head, and the MongoDB/Redis community-world results from that run are stale enough that I wouldn't sign off on them either way without a fresh run.

Non-blocking notes

A reconnect-time connection failure is fatal rather than budgeted. The retry budget only covers reader.read() errors. If reconnect() → connect() itself throws (the reopen fetch fails transiently — plausible during exactly the kind of server blip that triggers reconnect in the first place), the catch in pull errors the stream immediately with budget remaining. Folding connect failures into the same budgeted loop (count it, retry) would make the wrapper robust against the scenario it exists for. Fine as a follow-up.
The streamer's cancel-propagation wrapper trades away backpressure. The eager pump() loop in readFromStream reads upstream as fast as the network delivers and enqueues without consulting desiredSize, so a slow consumer now buffers the stream in the controller queue instead of letting the socket backpressure naturally (the old code returned response.body directly, which is pull-driven). A pull-based wrapper would keep the AbortController plumbing and preserve backpressure. For typical workflow stream sizes this is unlikely to matter, but it's an unnecessary semantic change for what is otherwise just abort plumbing.
Changeset bump types (patch for both packages on the GA channel) are defensible since this fixes silent truncation, even though it adds new behavior.

Once the override is reset and CI is green on a current run, this is good to land. The cross-PR sequencing in the description (this merging and releasing before the coordinated server-side behavior change takes effect) is the right order — until the server change ships, this code path simply never triggers, which makes it safe to release ahead.

VaguelySerious · 2026-06-10T18:42:17Z

+ *
+ * NOTE (temporary): this is intentionally pointed at the
+ * `peter-stream-timeout-error` workflow-server preview so this branch's e2e
+ * tests exercise the matching server-side stream-timeout behavior. It will be
+ * cleared back to '' once those server-side changes merge — not a review
+ * concern. While it is set, two CI signals are red by design and will go
+ * green again on reset: the "WORKFLOW_SERVER_URL_OVERRIDE is empty" lint
+ * guard, and the override-precedence cases in `utils.test.ts` (the hardcoded
+ * value intentionally wins over the env var, which those cases assert is
+ * absent).
 */
-const WORKFLOW_SERVER_URL_OVERRIDE = '';
+const WORKFLOW_SERVER_URL_OVERRIDE =
+  'https://workflow-server-git-peter-stream-timeout-error.vercel.sh';


Suggested change

*

* NOTE (temporary): this is intentionally pointed at the

* `peter-stream-timeout-error` workflow-server preview so this branch's e2e

* tests exercise the matching server-side stream-timeout behavior. It will be

* cleared back to '' once those server-side changes merge — not a review

* concern. While it is set, two CI signals are red by design and will go

* green again on reset: the "WORKFLOW_SERVER_URL_OVERRIDE is empty" lint

* guard, and the override-precedence cases in `utils.test.ts` (the hardcoded

* value intentionally wins over the env var, which those cases assert is

* absent).

*/

const WORKFLOW_SERVER_URL_OVERRIDE = '';

const WORKFLOW_SERVER_URL_OVERRIDE =

'https://workflow-server-git-peter-stream-timeout-error.vercel.sh';

*/

const WORKFLOW_SERVER_URL_OVERRIDE = '';

Reverts the temporary preview override (and its NOTE) so utils.ts has no diff. The matching server-side stream-timeout behavior is validated via its own PR; the SDK override must stay empty (lint guard enforces it). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

VaguelySerious added 2 commits April 23, 2026 15:18

Revert "[world-vercel] Use stream control frame for transparent recon…

0dc1546

…nection (#1790)" This reverts commit 5ef9ac2.

[core] Move stream reconnect logic to getReadable level

2f9db3c

Signed-off-by: Peter Wielander <mittgfu@gmail.com>

VaguelySerious changed the base branch from main to stable April 23, 2026 22:56

vercel Bot deployed to Preview – workflow-web April 23, 2026 22:57 View deployment

vercel Bot deployed to Preview – workbench-hono-workflow April 23, 2026 22:58 View deployment

vercel Bot deployed to Preview – workbench-nitro-workflow April 23, 2026 22:58 View deployment

vercel Bot deployed to Preview – workbench-astro-workflow April 23, 2026 22:58 View deployment

vercel Bot deployed to Preview – workbench-express-workflow April 23, 2026 22:58 View deployment

vercel Bot deployed to Preview – workflow-docs April 23, 2026 22:58 View deployment

vercel Bot deployed to Preview – workbench-fastify-workflow April 23, 2026 22:58 View deployment

vercel Bot deployed to Preview – example-workflow April 23, 2026 22:58 View deployment

vercel Bot deployed to Preview – workbench-nuxt-workflow April 23, 2026 22:58 View deployment

vercel Bot deployed to Preview – workbench-vite-workflow April 23, 2026 22:59 View deployment

vercel Bot deployed to Preview – example-nextjs-workflow-turbopack April 23, 2026 22:59 View deployment

vercel Bot deployed to Preview – workbench-sveltekit-workflow April 23, 2026 22:59 View deployment

vercel Bot deployed to Preview – example-nextjs-workflow-webpack April 23, 2026 23:00 View deployment

Add changeset for stream reconnect

0c582b0

vercel Bot deployed to Preview – workflow-web April 23, 2026 23:03 View deployment

vercel Bot deployed to Preview – workbench-fastify-workflow April 23, 2026 23:03 View deployment

vercel Bot deployed to Preview – workbench-vite-workflow April 23, 2026 23:03 View deployment

vercel Bot deployed to Preview – workflow-docs April 23, 2026 23:03 View deployment

vercel Bot deployed to Preview – workbench-hono-workflow April 23, 2026 23:03 View deployment

vercel Bot deployed to Preview – workbench-astro-workflow April 23, 2026 23:03 View deployment

vercel Bot deployed to Preview – example-workflow April 23, 2026 23:03 View deployment

vercel Bot deployed to Preview – workbench-nitro-workflow April 23, 2026 23:03 View deployment

vercel Bot deployed to Preview – workbench-sveltekit-workflow April 23, 2026 23:03 View deployment

vercel Bot deployed to Preview – workbench-express-workflow April 23, 2026 23:03 View deployment

vercel Bot deployed to Preview – workbench-nuxt-workflow April 24, 2026 00:48 View deployment

vercel Bot deployed to Preview – workbench-fastify-workflow April 24, 2026 00:48 View deployment

vercel Bot deployed to Preview – example-nextjs-workflow-webpack April 24, 2026 00:48 View deployment

vercel Bot deployed to Preview – example-nextjs-workflow-turbopack April 24, 2026 00:48 View deployment

vercel Bot deployed to Preview – workflow-swc-playground April 24, 2026 00:50 View deployment

docs callouts

39a3b26

Signed-off-by: Peter Wielander <mittgfu@gmail.com>

TooTallNate requested changes Apr 25, 2026

View reviewed changes

TooTallNate mentioned this pull request Apr 25, 2026

[core] Add wire-level framing for byte streams #1853

Open

This was referenced Apr 25, 2026

world-postgres: stream readers can stall after LISTEN disconnects or missed NOTIFY event #1855

Open

Clean up world-postgres LISTEN self-healing for upstream PR Pom4H/workflow#1

Closed

TooTallNate approved these changes Apr 27, 2026

View reviewed changes

10 -> 50

5675329

Signed-off-by: Peter Wielander <mittgfu@gmail.com>

VaguelySerious commented Apr 29, 2026

View reviewed changes

Comment thread docs/content/docs/deploying/world/vercel-world.mdx Outdated

VaguelySerious commented Apr 29, 2026

View reviewed changes

Comment thread docs/content/docs/foundations/streaming.mdx Outdated

Apply suggestions from code review

e58a449

Co-authored-by: Peter Wielander <mittgfu@gmail.com> Signed-off-by: Peter Wielander <mittgfu@gmail.com>

TooTallNate mentioned this pull request May 2, 2026

[world-vercel] Revert stream control framing (stable) #1892

Merged

shtefcs mentioned this pull request May 12, 2026

WorkflowChatTransport.reconnectToStreamIterator logs console.error on intentional client abort #1971

Open

VaguelySerious and others added 3 commits May 29, 2026 15:47

Merge remote-tracking branch 'origin/stable' into peter/stream-contro…

9f4d1b1

…l-at-getreadable-level # Conflicts: # packages/world-vercel/src/streamer.test.ts # packages/world-vercel/src/streamer.ts

VaguelySerious commented Jun 10, 2026

View reviewed changes

VaguelySerious mentioned this pull request Jun 10, 2026

[core] Forward-port stream reconnect to getReadable level #2318

Merged

TooTallNate approved these changes Jun 10, 2026

View reviewed changes

VaguelySerious commented Jun 10, 2026

View reviewed changes

VaguelySerious mentioned this pull request Jun 10, 2026

[core] Retry stream reopen against the reconnect budget #2334

Open

Merge branch 'stable' into peter/stream-control-at-getreadable-level

4a0258b

github-actions Bot mentioned this pull request Jun 11, 2026

Version Packages #2352

Open

Conversation

VaguelySerious commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

After shipping this

Uh oh!

changeset-bot Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

vercel Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 E2E Test Results

Summary

❌ Failed Tests

Details by Category

Check the workflow run for details.

Uh oh!

TooTallNate left a comment

Choose a reason for hiding this comment

Review

1. Byte streams lose auto-reconnect entirely

2. The "clean EOF means done" assumption needs verification

Minor

What looks good

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TooTallNate commented Apr 25, 2026

Recommended direction

1. Frame byte streams on the writer side

2. Use createReconnectingFramedStream for both branches on the reader side

3. WHATWG type: 'bytes' semantics are unaffected

Backwards compatibility

Cross-version exposures (post-version-skew-protection)

Proposed mechanism

One implementation note

What's still open in the current PR

Uh oh!

TooTallNate left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

VaguelySerious left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

AI Review: Blocking

Uh oh!

Choose a reason for hiding this comment

AI Review: Note

Uh oh!

Choose a reason for hiding this comment

AI Review: Note

Uh oh!

VaguelySerious commented Jun 10, 2026

Uh oh!

TooTallNate left a comment

Choose a reason for hiding this comment

Approve — with one hard pre-merge gate (the URL override) and two non-blocking design notes

The design is right

Hard gate before merge

Non-blocking notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

VaguelySerious commented Apr 23, 2026 •

edited

Loading

changeset-bot Bot commented Apr 23, 2026 •

edited

Loading

vercel Bot commented Apr 23, 2026 •

edited

Loading

github-actions Bot commented Apr 23, 2026 •

edited

Loading

2. Use `createReconnectingFramedStream` for both branches on the reader side

3. WHATWG `type: 'bytes'` semantics are unaffected